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A Heterogeneous High Dimensional 
Approximate Nearest Neighbor Algorithm 

Abstract 

We consider the problem of finding high dimensional approximate nearest neighbors. Suppose 
there are d independent rare features, each having its own independent statistics. A point x will have 
Xi = denote the absence of feature i, and Xi = 1 its existence. Sparsity means that usually Xi = 0. 
Distance between points is a variant of the Hamming distance. Dimensional reduction converts the sparse 
heterogeneous problem into a lower dimensional full homogeneous problem. However we will see that 
the converted problem can be much harder to solve than the original problem. Instead we suggest a 
direct approach. It consists of T tries. In try t we rearrange the coordinates in decreasing order of 

/ -i x Pi.ll , 1 , 1N . 

(l-r M ) ■ In (1) 

Pi,01+Pi,10 Pi,l* 

where < r t .i < 1 are uniform pseudo-random numbers, and the p's are the coordinate's statistical 
parameters. The points are lexicographically ordered, and each is compared to its neighbors in that 
order. 

We analyze a generalization of this algorithm, show that it is optimal in some class of algorithms, 
and estimate the necessary number of tries to success. It is governed by an information like function, 
which we call bucketing forest information. Any doubts whether it is "information" are dispelled by 
another paper, where unrestricted bucketing information is defined. 



I. Introduction 

Suppose we have two bags of points, X and Xy, randomly distributed in a high-dimensional 
space. The points are independent of each other, with one exception: there is one unknown point 
xq in bag X that is significantly closer to an unknown point xy in bag X\ than would be 
accounted for by chance. We want an efficient algorithm for quickly finding these two 'paired' 
points. 

The reader might wonder why we need two sets, instead of working as usual with X = X UXy. 
We have come a full circle on this issue. The practical problem that got us interested in this 
theory involved texts from two languages, hence two different sets. However it seemed that the 
asymmetry between X and X-y was not important, so we developed a one set theory. Than we 
found out that keeping X ,Xy separate makes thing clearer. 
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(2) 



Let us start with the well known simple homogeneous marginally Bernoulli(l/2) example. 
Suppose X ,Xi C {0, l} d of sizes n ,ni respectively are randomly chosen as independent 
Bernoulli(l/2) variables, with one exception. Choose randomly one point x e X , xor it 
with a random Bernoulli(p) vector and overwrite one randomly chosen x\ e X x . A symmetric 
description is to say that x 0: x 1 i'th bits have the joint probability 

p/2 (l- P )/2 
l-p)/2 p/2 

For some p > 1/2. We assume that we know p. In practice it will have to be estimated. 
Let 

lnM = In n + him - I(P)d (3) 

where 

I(P) = pm(2p) + (l-p)ln(2(l-p)) (4) 

is the mutual information between the special pair's single coordinate values. Information theory 
tells us that we can not hope to pin the special pair down into less than W possibilities, but 
can come close to it in some asymptotic sense. Assume that W is small. How can we find the 
closest pair? The trivial way to do it is to compare all the n m pairs. A better way has been 
known for a long time. The earliest references I am aware of are Karp,Waarts and Zweig [6], 
Broder [3], Indyk and Motwani [5]. They do not limit themselves to this simplistic problem, 
but their approach clearly handles it. Without restricting generality let n < n±. We randomly 
choose 

k PS log 2 n (5) 

out of the d coordinates, and compare the point pairs which agree on these coordinates (in other 
words, fall into the same bucket). The expected number of comparisons is 

n n 1 2~ k ~ n\ (6) 

while the probability of success of one comparison is p k . In case of failure we try again, with 
other random k coordinates. At first glance it might seem that the expected number of tries 
until success is p ~ k , but that is not true because the attempts are interdependent. The correct 
computation is done in the next section. In the unlimited data case d — > oo indeed 

T « p~ k « n ° g2l/p (7) 
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Is this optimal? Alon [1] has suggested the possibility of improvement by using Hamming's 
perfect code. We have found that in the n = n\ = n case, T fa 

n log 2 l/p can 

be reduced to 

T « n 1/p - 1+e (8) 

for any e > 0, see [7]. Unfortunately this seems hard to convert into a practical algorithm. 

In practice, most approximate nearest neighbor problems are heterogeneous. Coordinates are 
not independent either, but there is a lot to learn from the independent case. For starters let the 
joint probability matrix be position dependent: 

/ 



Pi 



l<i<d (9) 



Pi /2 (l-p<)/2 
\(l- Pl )/2 Pi /2 

This is an important example which we will refer to as the marginally Bernoulli(l/2) example 
It turns out that in each try coordinate i should be chosen with probability 

Pi ~ Pcut 



max 







(10) 



1 - Pcut 

for some cutoff probability < p cut < 1. An intuitive argument leading to that equation 
appears in section [TTTl 

Section [V] presents an independent data model, and a general nearest neighbor algorithm 
using its parameters. Section IXIIII proves a lower bound for its success probability. Section IXIIl 
proves an upper bound for a much lager class of algorithms. The lower and upper bound are 
asymptotically similar. The number of tries T satisfies 



In T ~ max 

A>0 



d 



i=i 



(ID 



where F(Pi,X) is defined in (|49l . The similarity to the information theoretic © suggests that 
F(Pi, A) is some sort of information function. We call it the bucketing forest information 
function. [7] proves a similar estimate for the performance of the "best possible" bucketing 
algorithm, involving a bucketing information function with a very information theoretic 
look. 

Section IVIIII shows that our algorithm preserves sparseness. Section|IX] shows that dimensional 
reduction is bad for sparse data. 
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/ 



(12) 



II. The Homogeneous Marginally Bernoulli(1/2) Example 

The well known homogeneous marginally Bernoulli(l/2) example has been presented in the 
introduction. It will be analyzed in detail because the main purpose of this paper is generalizing 
it. The analysis is non-generalizable, but the issues remain. Recall that we have a joint probability 
matrix 

p/2 (l-p)/2 
[l-p)/2 p/2 

For some p > 1/2. Without restricting generality let n < n\. We randomly choose 

k « min[log 2 n , {2p — l)d] (13) 

out of the d coordinates. The reason for k < (2p — l)d (which was omitted in the introduction for 
simplicity) will emerge later. We compare point pairs which agree on the chosen k coordinates. 
This is a random algorithm solving a random problem, so we have two levels of randomness. 
Usually when we will compute probabilities or expectations it will be with respect to these 
two sources together. The expected number of comparisons is n n 1 2~ k while the probability of 
success of one comparison is p k . (These statements are true assuming only model randomness). 
In case of failure we try again, with other random k coordinates. In order to estimate the expected 
number of tries till success we have to enumerate how many bits are identical in the special pair 
Xp, Xi. Let this number be j. Then the probability of success in a single try conditioned on j is 
3 
k 




Hence the expected number of comparisons is Tri\ where 



T = n 2 



d 
3=0 



( 



d 



p>(\-p) 




I 



\ 



For small d/k this is too pessimistic because most of the contribution to the above sum comes 
from unlikely low j's. We know that with probability about 1/2, j > pd. Hence we get a success 
probability of about 1/2 with an expected 



T 




i/d 



Now it is clear that increasing k above (2p — l)d increases T, which is counterproductive. 
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III. An Intuitive Argument for the Marginally Bernoulli(1/2) Example 

In full generality our algorithm is not very intuitive. In this section we will present an intuitive 
argument for the special case of the joint probability matrices 

/ 



1 < i < d (14) 



Pi /2 {I -Pi) /2 
\ (I - Pi ) /2 P i/2 

The impatient reader may skip this and the next section, jumping directly to the algorithm. Let 
us order the coordinates in decreasing order of importance 

Pi > P2 > ■ ■ ■ > Pd (15) 

Moreover let us bunch coordinates together into g groups of d 1: d 2 , . . . , d g coordinates, where 
Yfh=i dh — d, and the members of group h all have the same probability q h 

p dl+ ... +dh _ 1+1 = ■ ■■ = p dl +-+d h = Qh (16) 

Out of the d h coordinates in group h, the special pair will agree in approximately q h d h 'good' 
coordinates. Let us make things simple by pretending that this is the exact value (never mind 
that it is not an integer). We want to choose 

A; = log 2 n (17) 

coordinates and compare pairs which agree on them. The greedy approach seems to choose 
as many as possible from the group 1, but conditional greed disagrees. Let us pick the first 
coordinate randomly from group 1. If it is bad, the whole try is lost. If it is good, group 1 is 
reduced to size di — 1, out of which q±di — 1 are good. Hence the probability that a remaining 
coordinate is good is reduced to 

d± — 1 

After taking m coordinates out of group 1, its probability decreases to 

qidi — m 



di — m 

Hence after taking 



(19) 



91 — <72 , 

m = d\ (20) 

1 - Q2 
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coordinates, group 1 merges with group 2. We will randomly chose coordinates from this merged 
group till its probability drops to g 3 . At that point the probability of a second group coordinate 
to be chosen is 

92 - 93 (21) 



1 - 93 

while the probability of a first group coordinate being picked either before or after the union is 

93 9i ~ 93 



9i - 92 / 1 9i - 92 \ 92 



(22) 



1-92 V 1 - 92 / 1 - 93 1 - 93 
This goes on till at some qi = p cut we have k coordinates. Then the probability that coordinate 
% is chosen is 



max 



Pi - Pcut 







1 - Pcut 

as stated in the introduction. The cutoff probability is determined by 

A r 

Pi — Pcut 



max 

i=i 



: ,0 



k 



(23) 



(24) 



1 - Pcut 

The previous equation can be iteratively solved. However it is better to look from a different 
angle. For each try we will have to generate d independent uniform [0. 1] random real numbers 

< n,r 2 ,...,r d < 1 (25) 

one random number per coordinate. Then we take coordinate i iff 

, Pi ~ Pcut tr\i-\ 

n < (26) 

1 - Pcut 

Let us reverse direction. Generate r« first, and then compute for which p cu ts coordinate % is 
taken: 



Pcut 



< 2 Xi = max 



Pi - Ti 



.0 



(27) 



Denoting the right hand side by 2~ Xi is unnecessarily cumbersome at this stage, but will make 
sense later. We will call A» the random exponent of coordinate i (random because it is r« 
dependent). Remember that p cut > so Aj = 00 means that for that value of coordinate % 
can not be used. Now which value of p cut will get us k coordinates? There is no need to solve 
equations. Sort the Aj's in nondecreasing order, and pick out the first k. Hence 

p cut = 2~ A - (28) 

where the cutoff exponent \ cut is the value of the k'th ordered random exponent. 
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It takes some time to comprehend the effect of equation (1271) . The random element seems over- 
whelming. The probability that coordinate 1 will have larger random exponent than coordinate 
2 when p\ > p 2 is 

ii^i (29) 

21 -p 2 

In particular the probability that a useless coordinate with pi = 0.5 precedes a good coordinate 
with pi = 0.9 is 0.1 ! However the chance that the useless coordinate will be ranked among the 
first k is very small, unless we have so little data that it is better to take k < lnn . 

IV. An Unlimited Homogeneous Data Example 

The previous section completely avoids an important aspect of the general problem which will 
be presented by the following example. Suppose we have an unlimited amount of data d — > oo 
of the same type 

Vnn Vm \ 

l<i<d (30) 



P 

where 



Pio Pn 



Poo + Poi + Pio + Pn = 1 (31) 



This is the joint probability of the dependent pair, and the marginal probabilities govern the 
distribution of the remaining points. In the set X the probability that bit i is is 

Po* = Poo + Poi (32) 

and similarly in X\ 

P*o = Poo + Pio (33) 

The * means "don't care". A reasonable pairing algorithm (very similar in this case to the general 
algorithm) is to pick coordinates at random 1 < ii, i 2 , . . . < d. After picking k coordinates, an 

X point xi = (xix, X12, • • • , xid) is in a bucket of expected size 

k 

n o\{Px lt * (34) 

t=i 

Hence it makes sense to increase k only up to the point where ^oll^iPau* < 1> an d men 
compare with all Xi points in its cell. This makes k point dependent. The expected number of 
comparisons in a single try is at most n\. What is the approximate success probability? 
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(37) 



V 



Our initial estimate was the following. The probability that the special pair will agree in a 
single coordinate is poo+Pu The amount of information in a single X coordinate is —po* hipo* — 
Pulnpu so we will need about 

k » , , (35) 

-po* hipo* -pi*mpi* 

coordinates, and the success probability is estimated by 

'"Ooo+Pii) 

(Poo + Pn) k ~ n P0 * lnp "* +p i* lnp i* (36) 
This estimate turns out to be disastrously wrong. For the bad matrix 

^ 1 - 2e e 
e 

with small e it suggests exponent — 1/ lne, while clearly it is worse than 1. The interested reader 
might pause to figure out what went wrong, and how this argument can be salvaged. 

There is an almost exact simple answer with a surprising flavor. We expect n^ A , so let us check 
that for consistency. Pick the first coordinate. With probability p 00 , the expectation n is reduced 
to uqPq*. With probability p n it is reduced to n p u , and with probability p 2 2 = 1 — Poo — Pn 
the try is already lost. Hence 

n x ~ Poo(noPo*)~ X + Pii(n Pi*)~ X (38) 
Happily n drops out, leaving us with 

PooP * X + PnPu X = 1 (39) 

which determines the exponent A. It is very easy to convert this informal argument into a formal 
theorem and proof. A harder task awaits us. 

V. The General Algorithm and its Performance 

Definition 5.1: The independent data model is the following. We generalize from bits to b 
discrete values. Let the sets 

X ,*iC{0,l,...,&-l} d (40) 

of cardinalities 

#X = n , #X X = m (41) 
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be randomly constructed in the following way. The X points are identically distributed indepen- 
dent Bernoulli random vectors, with pjj* denoting the probability that coordinate % has value j. 
There is a special pair of X , X± points, randomly chosen out of the tiqUi possibilities. For that 
pair the probability that both their i'th coordinates equal j is p^ 3 - with no dependency between 
coordinates. The rest of the X 1 points can be anything. (We abbreviate the usual notation p^ 
to pij, because we will consider only the diagonal and the marginal probabilities.) Denote 

6-1 
j=0 



Pi 



Pi,o Pi,i ■ ■ ■ Pi,b-i x 



(43) 



Pi,0* Pi,l* ■ ■ ■ Pi,b-1 * J 

We propose the following algorithm. It consists of several bucketing tries. For each try we 
generate d independent uniform [0, 1] random real numbers 

<r 1 ,r 2 ,...,r d < 1 (44) 

one random number per coordinate. For each coordinate % we define its random exponent 
\ > to be the unique solution of the monotone equation 

^ = 1 (45) 

or +oo when there is no solution, (pf^ means (pij*) Xl )- We lexicographically sort all the n + rii 
points, with lower exponent coordinates given precedence over larger exponent coordinates, and 
the coordinate values 0, 1, . . . , b— 1 arbitrarily arranged, even without consistency. Now each X x 
point is compared with the preceding a and following a X points (or fewer near the ends). The 
comparisons are done in some one-on-one way, and the algorithm is considered successful if it 
asks for the correct comparison. The best a is problem and computer dependent, but is never 
large. Each try makes at most 2arii comparisons. Of course there is extra n + rii point handling 
work. 

A nice way to write the lexicographic ordering of the algorithm follows. Suppose that in try 
t the sorted random exponents are 

< X n2 < ■■■ < (46) 

Then each point 

x = (x u x 2 ,...,x d ) e {0,1,...,6- l} d (47) 
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is projected into the interval [0, 1] by 

i=l " i=0 

The projection order is a lexicographic order. For large dimension d, Rt(x) is approximately 
uniformly distributed in [0, 1]. 

We will prove that the number of tries T needed for success satisfies 



InT ~ max 

A>0 



d 

Alnn -^F(P i; A) 

i=i 



(48) 



where 

Definition 5.2: The bucketing forest information function F(Pi, A) is 

h 

Pij 



F(Pi,\)= min J2Pi,j^—= (49) 

< q i>0 , q ijb j=0 %j 

ES=oftj = i 

E6— 1 i 
i=o pT - l 



max 

0<r. 



•7=0 \ Pi,j* 

The two dual extrema points are related by 



ax Pi J In f 1 - n + n J ' f J (50) 



1_^.^.(2^) (51) 



1 - Ti + n-x 



I, J* 



For E -=o |H- < 1 = 0, = PiJ , F(P is A) = 0. Otherwise 



6-1 „ 6-1 



Q'iji Pi 

j=0 Pi,j* j=0 0- ~~ T i)P i,j* r i 

We will get (|49l from the upper bound theorem, and (l50l) from the lower bound theorem. Their 



y = y = i (52) 



equivalence is a simple (though a bit surprising) application of Lagrange multipliers in a convex 
setting. Representation (|50l ) implies that F(P, A) is an increasing convex function of A. 
The cutoff exponent \ cut attains (|48l) . It has several meanings. 

1) In each try the coordinates with Aj < \ cut define a bucket of size e en ° for some small real 
e. 

2) If we double n the number of tries needed to achieve success probability 1/2 is approx- 
imately multiplied by 2 Xcut . 
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3) If we delete coordinate i, then the number of tries needed to achieve success probability 



Switching X and X\ may result in a different algorithm. Coordinate values can be changed 
and/or merged in possibly different ways for X , X\. For each possibility we have an estimate 
of its effectiveness, and the best should be taken. 

In real applications there is dependence, and the probabilities have to be estimated. Our 
practical experience indicates that this is a robust algorithm. Details will be described in another 
paper. 



There is an interesting alternative to the random ordering of coordinates. Suppose we have 
training sets X , Xi both of size n, such that each X point is paired with a known X 1 point. Let 
us estimate the probabilities Pj by their empirical averages. For each coordinate i its exponent 
Xi > is defined by 



Arrange the coordinates in the greedy order of nondecreasing exponents. Perform the first try 
using that order just like in the previous algorithm. Remove the pairs found from the training 
data, and repeat recursively on the reduced training data. Stop after the training set is reduced 
to 1/3 (for example) of its original size, or you run out of memory. The memory problem can 
be alleviated by keeping only the heads of coordinate lists, and/or running training and working 
tries in parallel. 

This simpler algorithm has a more complicated and/or less efficient implementation, and lacks 
theory. 



1/2 is on average multiplied by e Fi ^ Xcut \ 



VI. An Alternative Algorithm 




1 



(53) 



VII. Return of the Marginally Bernoulli(1/2) Example 




(54) 



which can be recast as the familiar 



2 -Az _ P* T _i 

1 ~Ti 



(55) 
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F(Pi, A) = ^ 



The bucketing forest information function is 

Pihi ^ + {1- Pi ) In Pi>2- X 
pi < 2- x 

The cutoff exponent attains (|48l) . The extremal condition is the familiar 

Pi - 2~ Xcut 



d 

V" max 

i=l 







Let us now specialize to p\ = p 2 



A 



cut 



■ = Pd = P- Then 

pd - log 2 n 

■ ~ log 2 



log 2 n 



(56) 



(57) 



d - log 2 n 

Notice that log 2 n > (2p — l)d is equivalent to A cut > 1. In general \ cut > 1 signals that the 
available bucketing forest information is of such low quality that the trees are worse than random 
near their leafs. 



VIII. Sparsity 
Let us specialize to sparse bits: 6 = 2, 

Pi,i*,Pi, +Pi,n « 1 
We will also assume that for some fixed 5 > 



Pt,u > $ (Pifll + Pi, 



w) 



The equation 



Pi,00 Pi,U 



(58) 



(59) 



(60) 



has two asymptotic regimes: one in which p Xi ^ is nearly constant and p Xi u changes, and vice 
versa. The first regime is the important one: 

PiXX 



PiflQ + 



i 



(61) 
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In practice the probabilities have to be estimated from the data, and sparse estimates must be 
unreliable, so we used the more conservative 

1/Ai = (1 - rO— ^ In — (63) 

Pifll + Pi,W Pi,l* 

A very important practical point is that the general algorithm preserves sparsity. Suppose that 
instead of points 

x = (x 1 ,x 2 ,...,x d ) E {0,l} d (64) 

we have subsets of a features set D of cardinality d : 

D X CD (65) 

In try t we use a hash function hash t : D — > [0, 1]. For each feature % E D its random exponent 
Aj is computed using the pseudo random 

Ti = hashj(i) (66) 
and the random exponents of x are sorted 

K! < K 2 < ■ ■ ■ < K v (67) 

Then the sequence of features 

(7ri,7r 2 ,...,7r„) (68) 
is a sparse representation of x whose lexicographic order is used in try t. 

IX. The Downside of Dimensionality Reduction 

Another way of handling sparse approximate neighbor problems is to convert them into dense 
problems by a random projection. For dense problems taking some k out of the d coordinates 
can be an effective way to reduce dimension. For sparse problems such a sampling reduction 
will remain sparse, hence dense projection matrices are used instead. We will show that this 
can result in a much worse algorithm. Let us consider the unlimited homogeneous data example 
with 

Poi = Pio, no = ni = n (69) 
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because in general it is not clear which projections to take and how to analyze their performance. 
We have a d dimensional Hamming cube {0, l} d . The Hamming distance between two random 
X , Xi points is approximately 

2p *Pi*d (70) 
The Hamming distance between the two special points is approximately 

2 Po ,id (71) 

Hence when the dimension d is large, the random to special distances ratio tends to 

Po*Pi* , n ~, 
c = (72) 

Poi 

The ideal dimensionality reduction would be to project {0, l} d into a much lower dimensional 
{0, l} fc in such a way that the images of the X ,Xi points are random {0, l} k points, and the 
distance between the two special images is approximately k/2c (k/2 is the approximate distance 
between two random image points). Hence after the dimensionality reduction we will have a 
homogeneous marginally Bernoulli(l/2) problem with 

p = 1 - l/2c (73) 

The standard nearest neighbor algorithm solves this in approximately 

n log2 (74) 

tries. Actual dimensional reductions fall short of this ideal. The Indyk and Motwani theory [5] 
states that 

n 1/c (75) 

tries suffice. The truth is somewhere in between. 

In contrast without dimensionality reduction our algorithm takes approximately n x tries where 
A is determined by 

1 - Pi* - Poi + Pu =1 (76) 
(1-Pi*) x Pi* 
In the asymptotic region (1581591) inserting r = into (|62l) results in 



In 

A p 



^ I 2ppi 

Pll 



ln^±i 

c - x (77) 



ml/pi, lnl/pi* 

We encourage the interested reader to look at his favorite dimensional reduction scheme, and 
see that the In 1/pi* factor is really lost. 
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X. Lexicographic and Bucketing Forests 
Our general algorithm is of the following type. 

Definition 10.1: A lexicographic tree algorithm is the following. The d coordinates are ar- 
ranged according to some permutation. Than a complete lexicographic ordered tree is generated. 
It is defined recursively as a root pointing towards b subtrees, with the edges denoting the possible 
values of the first (after permutation) coordinate arbitrarily ordered. The subtrees are complete 
lexicographic ordered trees for the remaining d—1 coordinates. In particular the lexicographic tree 
has b d ordered leafs, each denoting a point in {0,1, — l} d . A lexicographic tree algorithm 
arranges the n + n\ X U X 1 points according to the tree, and then compares each x\ point 
with its a neighbors right and left. This insures no more than 2arii comparisons per tree. A 
lexicographic forest is simply a forest of lexicographic trees, each having its own permutation. 
It succeeds iff at least one tree succeeds. 
An obvious generalization is 

Definition 10.2: A semi-lexicographic tree algorithm has a 'first' coordinate and then recur- 
sively each subtree is semi-lexicographic, until all coordinates are exhausted. 
For example we can start with coordinate 3, and than consider coordinate 5 if the value is 0, or 
coordinate 2 if the value is 1 and so on. 

The success probability of a lexicographic forest is very complicated,even before randomizing 
the algorithm. For that reason we will consider an uglier non-robust class of algorithms that are 
easier to understand and analyze. 

Definition 10.3: A bucketing tree algorithm is predictably recursively defined. Either compare 
all pairs (a leaf bucket), or take one coordinate, split the data into b parts according to its value 
(some parts may be empty), and apply a bucketing tree algorithm on each part separately. In 
order to have no more than an expected comparisons we will insist that each leaf expects no 
more than a points belonging to X . A bucketing forest is simply a forest of bucketing trees. It 
succeeds iff at least one tree succeeds. 

The success probability of a bucketing forest is no bed of roses. Let us denote a leaf by 
w G {0, 1, ... , b} d , with b indicating that the corresponding coordinate is not taken. The leaf w 
expects 

Pi, Wi * Wi<b 



d 

noil 

i=i 



(78) 

Wi = b 
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X points, and its success probability is 



n 



p 



Wi < b 



1 Wi = b 



(79) 



The success probability of a tree is the sum of the success probabilities of its leafs. The success 
probability of the whole forest is less than the tree sum. Suppose the whole forest contains L 
leafs w 1 ,w 2 , • • • , w L . Let ?/ e {0,1,..., b} d denote the abbreviated state of the special points: 



Vi 



b x ,i ^ x hi 



(80) 

Z)S=oPi,j- The success 



The value b denotes disagreement and its probability is p i>b = '. 
probability of the whole forest is 

d 

s = E IlPiM- 

j/e{0,l,...,6} d i=l 

• 1 - II f 1 - Ii( W l,i ==Vi II W l,i == b )] 
1=1 \ i=l / . 

Remember that (wi ti == yi \ \ wi :i == b) = 0, 1 hence the two rightmost products are just logical 
ands, and 1 — () is a logical not. 

Our algorithm is almost a bucketing forest, except that the leaf condition is data dependent 
(for robustness). A truly variable scheme can shape the buckets in a more complicated data 
dependent way, see for example Gennaro Savino and Zezula [4]. Non-tree bucketing can use 
several coordinates together, so that the resulting buckets are not boxes, see for example Andoni 
and Indyk [2] or [7]. 



XI. A Bucketing Forest Upper Bound 

In this section we will bound the performance of bucketing forest algorithms. It is tricky, but 
technically simpler and more elegant than proving a lower bound on the performance of a single 
algorithm. 

Theorem 11.1: Assume the independent data model. The success probability P of a nonempty 
bucketing tree whose leafs all have probabilities at most l/N is at most 

d ( 6-1 



P 



i=l 



j=0 Pi,j* 



h3 
A 



(81) 
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for any A > 0. We do not even have to assume < Pij*. 

Proof: Use induction. Without losing generality split coordinate 1. The induction step 

b-l d ( b-1 . \ 

p < e pij (N P i, 3 *r x n max i. e zrM (82) 

j=0 t=2 \ j=0 J* / 

is valid for both proper and point-only subtrees. The maximization with 1 is necessary because 
coordinates can be ignored. ■ 
Theorem 11.2: Assume the independent data model. Suppose an bucketing forest contains T 
trees, its success probability is S, and all its leafs have probabilities at most 1/N. Than for any 
A > 



InT > AlniV + m- 



2 

where 



i=l 



V(Pi,X) = E Pu ( In — - E ln — ) (83) 

j=0 \ %3 k=0 



and the q^/s are the minimizing arguments from F's definition (1491 ) 

Proof: The previous theorem provides a good bound for the success probability of a single 
tree, but it is not tight for a forest, because of dependence: the failure of each tree increases the 
failure probability of other trees. Now comes an interesting argument. Recall that the success 
probability of the whole forest formula (|8TT) . For any z and q it j > we can bound 



where 



We insist upon 



S < Prob{Z >z} + e z S Q (84) 



Z = Y^\ n ^ (85) 

i=l Qi,yi 

prob{z>z}= e Yipn»-(i:^>z) 

ye{0,l,...,b} di=1 ^ l=1 ^' Vi ' 

d 

s q= e n %yi ■ 

j/6{0,l,...,6} d *=1 

L I d \ 

1 - n 1 - ik^m == y< ii w u == b ) 

1=1 \ i=l / 
b 

E^i = 1 (86) 

j=0 
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so that we can use the previous lemma to bound 



S Q < TP q < TN- X f[ max f 1, ]T ] (87) 

1=1 V j=0 Pi,j* 



The other term is handled by the Chebyshev bound: for z > E(Z) 



Together 



The reasonable choice of 



results in 



Var(Z) 
(z-E(Z)) 1 



^ < t ~7~~2 + e^ Q (89) 



= E(Z) + yj2Vai(Z)/S (90) 



5 < 2e E ( z )+v /2V -(^)/ s 1 SQ (91) 



Notice that this proof gives no indication that the bound is tight, nor guidance towards 
constructing an actual bucketing forest, (except for telling which coordinates to throw away). 

We tried to strengthen the theorem in the following way. Instead of restricting the expected 
number of points falling into each leaf bucket, allow larger leafs and only insist that the total 
number of comparisons is at most aN. Surprisingly the strengthened statement is wrong, and 
a Targe leafs' bucketing forest is theoretically better than our algorithm. But it is complicated 
and non-robust. 

XII. A Semi-Lexicographic Forest Upper Bound 

There remains the problem that we gave a lexicographic forest algorithm, but a bucketing 
forest upper bound. It is a technicality, which may be skipped over with little loss. Any semi- 
lexicographic complete tree can be converted into a bucketing tree in an obvious way: Prune the 
complete tree from the leafs down as much as possible, preserving the property that each leaf 
expects at most a/2 points from X . The success probability of the semi-lexicographic tree is 
bounded by 

P < P tree + R (92) 
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where P tree is the success probability of the truncated tree, for which we have a good bound, and 
a remainder term associated with truncated tree vertexes expecting more than a/2 tree points. 

Lemma 12.1: Assume the independent data model and consider a semi-lexicographic tree with 
the standard coordinate order (that does not restrict generality) and a totally random values order. 
Assume that the special points pair agree in coordinates 1, 2, . . . , % — 1, but disagree at coordinate 
i : 

2/i, 3/2, • • • ,Vi-i ^b, y { = b (93) 



Conditioning on that, the probability of success is at most 

2a 



n Pl,ylP2,y 2 ■ ■ -Pi-lM-i 

Proof: Denote 



(94) 



P = Pl,ylP2,V2---Pi-l,Vi-i ( 95 ) 

Let m be the number of X points agreeing with the special pair in their first i — \ coordinates. 
Its probability distribution is l+Bernoulli(p, n — 1). Let us consider these m points ordered 
by the algorithm. The rank of the special X point can be 1,2, ... , m with equal probabilities. 
Those m ordered points are broken up into up to b intervals according to the value of coordinate 
i. Where does the special Xi point fit in? It is in a different interval than the X special point, 
but its location in that interval, and the order of intervals is random. Hence the probability that 
the two special points are at most a + 1 apart is at most 2a/ m. This has to be averaged: 

n / 



E 

m=l 



U ' 1 Lr-i(l-p)«-2a = 2a (%) 
m-1 m np 



Theorem 12.2: Assume the independent data model. Then the success probability of any semi- 
lexicographic tree with a totally random coordinate values order is at most 

Jy i=l \ j=0ri,j*J 

for any < A < 1, where 

JV = max (l, (98) 
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Proof: Without restricting generality assume that the coordinate have the standard order. 
We have established that 

a<t<i " i =' p, «- V >-<> 

< Wi,W2, . . . , Wt < b 

The negative terms can be shifted to the next t : 

l<t<d 
< wi,W2, . . . ,w t < b 



Denote 



Rw 1 ,...,Wg — n „ ' Pt,VH*J 

i=s+l Pi,Wi* 

s <t< d 



< w s+ i, w s +2, ■■■,w t <b 

We will prove by induction from the leafs down that 



B < iV 1_A lnfeiV V (99) 

±L iu\,voi,...,w s -i J v u;i,...,«) s 111 y 1 ' 1 y wi,...,w s -iJ \s J ) 

• n ma, 1,E^) (100) 

i=s+l \ j=0 / 

where 

8=1 

The induction step boils down to 

In (bN^^ ...jWs—1 J — (1 Ps,W s *) 1^ (^-^Wl,...,W s — iPs,w s *j 

which is obviously true. ■ 
Theorem (II 1.2b is converted into 

Theorem 12.3: Assume the independent data model. Suppose a semi-lexicographic forest with 
a totally random coordinate values order contains T trees, its success probability is S, and 

N = max (l, — ) (102) 
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Than for any < A < 1 



lnT> AlniV-ln [2 In (e 4 ' 5 iv)] + 



z \ D i=l i=l 
XIII. A Lower Bound 
Theorem 13.1: Assume the independent data model and denote 

a 

Let e > be some small parameter, and let Let A, ri, r 2 , . . . , r rf attain 

(1 + e)AlniV + 



mm max 

A>0 0<r u ...,r d <l 



+ ( 1 - r i + 



i=l j=0 



Pi,j* / . 



The extrema conditions are 



d b-i — r-lnn- • 

E£^ (1 /;.;f j ;, =(i+e)iniv 

i=i j=o V 1 1 i)Pi,]* ~ ' i 



and r, = or 



6-1 



^ (1 - r,W,, + n 



Ki<d 



3=0 



Suppose that for some 5 < 1/7 

d b 1 

E E Pi,3 Ml - n + {j + b)r ip -^} - 

i=l j=0 \ 

- £ ft,* ln[l -r, t + (k^ b)r lP -l] < e 2 5\ 2 (In Nf 



d b ~ l ( —r \nv • 

i=l j=0 



.(1 -n)Pi,j* + r { 



,k* 



t4 -rj lnpi,, 

C 1 - ^Kfc* + n 



< e 2 <5(lnAT) 2 /4 



ci 6-1 

EEp 

i=l j=0 



(! - r i)Pi,j* + r 



1 2 



< e^QniV) /8 
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Then the general algorithm with T tries where 

1 



has success probability 



lnT > In- + (1 + 3e)AlniV- 
o 

i=lj=0 \ fi,j* , 



(111) 
(112) 

(113) 



Moreover there exists a bucketing forest with T trees and at least 1 — 75 success probability. 

The alarmingly complicated small variance conditions are asymptotically valid, because the 
variances grow linearly with In N. However there is no guarantee that they can be always met. 
Indeed the upper bound is of the Chernof inequality large deviation type, and can be a poor 
estimate in pathological cases. 

Definition 13.1: Let Y, Z be joint random variables. We denote by Y z the conditional type 
random variable Y with its probability density multiplied by 

(114) 



E[e 2 



In the discrete case Z, Y would have values y u zi with probability p it Then Y z has values iji 
with probability 

^ (115) 



Lemma 13.2: For any random variable Z, and A > 



InProb {Z > E [Z xz ]} < InE [e xz \ - AE [Z xz 
In Prob \z > E [Z xz \ - ^2Var [Z xz ]j > 



(116) 



(117) 



> InE 



- AE [Z xz ] - In 2 - A v /2Var [Z xz ] (118) 
Proof: The upper bound is the Chernof bound. The lower bound combines the Chebyshev 
inequality 

Prob j|Z AZ -E[Z AZ ]| < v /2Var[Z AZ ]| > 1 (119) 

with the fact that the condition in the curly bracket bounds the densities ratio: 

e 



In — r —■, = In ■ 



< 



< - InE 



E [e xz ] " E [e xz j 
+ AE [Zxz] + A^/2Var [Z xz 



(120) 
(121) 
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It is amusing, and sometimes useful to note that 

<91nE 



V[Z XZ ] = L (122) 



XZl 



Var[Z AZ ] = (123) 

We will now prove the theorem I13.ll Proof: Let A > be a parameter to be optimized. 
Let w E {0, l} d be the random Bernoulli vector 

wt = (Ai < A) (124) 

where Aj is the i'th random exponent. In a slight abuse of notation let < < 1 denote not a 
random variable but a probability 

Ti = Prob{w, == 1} = Prob{Ai < A} (125) 

We could not resist doing that because equation (1451 ) is still valid under this interpretation. 



Another point of view is to forget (1451) and consider ri a parameter to be optimized. Again let 
y E {0, 1, . . . , b} d denote the abbreviated state of the special points x ,Xi. Let us consider a 

single try of our algorithm, conditioned on both y and w. The following requirements 

d 

Y[(l - Wi + Wiivi ^ b)) = 1 (126) 

i=i 

d la 

J](l -Wi + w&iw) <n = 7^- q ( 127 ) 

state that the expected number of X points in the bucket defined by the coordinates whose 
Wi — 1 with value yi is at most a/2. Then the probability that the actual number of bucket 
points is more than a is bounded from above by 1/2. A more compact way of stating (11261) and 
(11271) together is 

Z(y,w)>\nN (128) 

d 

Z{y, w) = ^ In [l - Wi + Wi( yi + b)p~^] (129) 
i=i 

Summing over w gives success probability of a single try, conditioned over y to be at least 

1 d 

P(y)>~ E nP- u '<)( 1 - r <)+wl[%«')^ llliV ] 

w£{0,l} d i=l 
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the reverse Chernof inequality can be rewritten as 

lnProbjzfe) > V{y) - pW{y)} > (138) 



> U(y) - XV(y) - In 2 - XpW(y) (139) 
It is time for the second inequality tier. For any 5 < 1/3 

Prob{|£%) - E[U]\ < yJVax[U]/5, (140) 

\V(y)-E[V}\ < ^r[V}/5, (141) 

W{y) < E[W}/5) > 1 - 35 (142) 



where 



d b 



E[t/] = E E PW - + (J &)r<ft£] (143) 



i=l i=o 

<2 6-1 



Hence 



lnProbjz(y) > E[V] - ^Var[F]/5- ^/2E[W]/<5} > 



> E[U] - XE[V] - In 2 - JVar[?7]/5 



-Ay Var[V]/(J - A v /2E[W]/5 
Now we have to pull all strings together. In order to connect with (11301) we will require 

E[V] = (l + e)lniV (145) 



'Vax[V] + y2E[iy] < e5 1/2 In JV (146) 
for some small e > 0. Recalling (1135k condition (11451 ) is achieved by choosing A to attain 



mm[-(l + e)AlniV + E[£7]] (147) 



If (fT46l) holds, then 



In P(y) > -(l + 2e)AlnJV + E[C7] - In 4- y/Vai[U]/8 (148) 
with probability at least 1 — 35. Recalling (11331 ) the success probability is at least 

S > j-g - (149) 

1 + 4 e (l+2e)Aln7V-E[C/] +A /Var[C/]/5 irp 
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XIV. Conclusion 
To sum up, we present three things: 

1) An approximate nearest neighbor algorithm (1451) . and its sparse approximation (|63l) . 

2) An information style performance estimate (|48l) . 

3) A warning against dimensional reduction of sparse data, see section ITXl 
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